Introduction to Open Data Science - Course Project

About the project

The course looks fantastic. I have plenty of experience in R, Rstudio, git, and GitHub, which are the main tools of my daily work. With this course I hope to not only get hands on with the selection and application of various statistical techniques, but also gain a better outlook on the best practices to visualize and share data.

date()
## [1] "Fri Dec 03 11:07:27 2021"

Github repository: https://github.com/SchahzadSaqib/IODS-project


Exercise 2: Data analysis (Regression and model validation)

date()
## [1] "Fri Dec 03 11:07:27 2021"

1. Data exploration

library(tidyverse)

#' read in data for analysis
lrn14 <- read.csv(here::here("data", 
                             "lrn14_summ.csv")) %>%
  dplyr::select(gender, everything()) #' reorder columns

#' take a look at the data
dplyr::glimpse(lrn14)
## Rows: 166
## Columns: 7
## $ gender   <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "M", "F", "~
## $ Age      <int> 17, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 2~
## $ Attitude <dbl> 1.7, 3.2, 3.7, 3.8, 1.9, 2.0, 3.2, 3.6, 3.7, 4.0, 1.9, 1.9, 2~
## $ Points   <int> 26, 17, 18, 28, 23, 20, 24, 30, 18, 27, 12, 12, 21, 20, 19, 2~
## $ deep     <dbl> 3.916667, 4.500000, 3.166667, 3.166667, 3.250000, 4.083333, 4~
## $ stra     <dbl> 4.625, 3.375, 2.625, 4.000, 3.875, 3.375, 3.625, 2.625, 2.625~
## $ surf     <dbl> 3.416667, 3.166667, 3.416667, 2.250000, 3.000000, 2.833333, 2~

The data comes from the international survey of Approaches to Learning, supported by the Teachers’ Academy funding for KV in 2013-2015. The data consists of 166 individuals and 7 variables. There are two descriptive variables: age and gender, and 5 variables that represent the cumulative results of the questionnaires in the survey. These are “points” (total exam points for each individual), “attitude” (global attitude towards statistics, cumulative sum of 10 questions ranking from 1-5), “deep” (a set of questions targeting the deep approach to learning), “stra” (strategic approach to learning), and “surf” (surface approach to learning).

Further details and explanations regarding this dataset can be found here

2. Graphical overview and summaries

library(GGally)

#' plot grahic summary of variables
GGally::ggpairs(lrn14, 
                mapping = aes(col = gender, 
                              fill = gender,
                              alpha = 0.2), 
                lower = list(combo = GGally::wrap("facethist", 
                                                  bins = 20))) +
  #' assign fill and colour
  viridis::scale_colour_viridis(option = "G",
                                end = 0.8,
                                discrete = T) +
  viridis::scale_fill_viridis(option = "G",
                              end = 0.8,
                              discrete = T)

From the descriptive variables we can observe that there was a higher proportion of females in the participants and that the average age of the participants was ~20-25. There are two pairs of variables that correlate: 1) Points and attitude: 0.422, which shows that higher exam points are correlated to better attitude towards statistics and 2) deep and surf: -0.324, which shows that deep learning approaches are negatively correlated with surface approaches.

Regression mode and validation

Using exam points (points) as our dependent variable, we will fit a linear model to determine which independent variables are most crucial in explain the distribution of exam points in the dataset. Attitude is the first choice since it was seen to be highly correlated with exam points. We also add the variables surf and stra to the model since they are also seen to be marginally correlated.

#' linear model
lm_mdl <- stats::lm(formula = Points ~ Attitude + surf + stra, 
                    data = lrn14)

#' model summary
summary(lm_mdl)
## 
## Call:
## stats::lm(formula = Points ~ Attitude + surf + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## Attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

As expected, attitude is statistically significant (p-value 1.93e-08). The variables surf and stra however so seem to be significant. This model explains ~19% (R2 0.1927) of the exam point distribution. Next, we will remove the non-significant variables from the model.

#' linear model
lm_mdl <- stats::lm(formula = Points ~ Attitude, 
                    data = lrn14)

#' summary
summary(lm_mdl)
## 
## Call:
## stats::lm(formula = Points ~ Attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Our new model still explains ~19% of the distribution (R2 = 0.1856) and attitude is statistically significant.

Diagnostic plots

Next, we validate our model by plotting some diagnostics plots

#' diagnostic plots
par(mfrow = c(3, 1))
plot(lm_mdl, which = c(1,2,5))

Residuals vs Fitted plot: These plots illustrated the patterns of the residuals. While using a linear model, it is crucial to know that the residuals to not exhibit any non-linear patterns which would need alternative model to capture. The residuals look fairly linear in our plot, which would indicated that the linear model was the correct choice.

Normal Q-Q plot: The plot illustrates whether the residuals are normally distributed. They should roughly follow the vertical line, with some room for deviations in the tail end. These residuals seem normally distributed with a few potential outliers.

Residuals vs Leverage plot: The plot indicates whether there are standout samples that are highly influential towards the overall distribution and regression analysis reuslts. Such cases/samples would cross dashed lines in the upper and lower right corners of the plot, which represent the Cooks distance. No such instances are observed in our plots, which means there are no highly influential samples present in our data.


Exercise 3: Data analysis (Logistic regression)

This data set comes from a Portuguese study titled “Using data mining to predict secondary school student performance”. The data is collected from two Portuguese secondary schools and consists of descriptive attributes of students personal lives as well their performance in school, measured by their average grade performance. Further details about the study and the data set can be seen here and here.

Lets take a look at the data itself

#' load main library
library(tidyverse)

#' read in data
stu_aggr <- read.csv(here::here("data", 
                                "stu_aggr.csv"))

#' check the structure and dimensions
dplyr::glimpse(stu_aggr)
## Rows: 370
## Columns: 47
## $ school       <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP~
## $ sex          <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F~
## $ age          <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 1~
## $ address      <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U~
## $ famsize      <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "~
## $ Pstatus      <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T~
## $ Medu         <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, ~
## $ Fedu         <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, ~
## $ Mjob         <chr> "at_home", "at_home", "at_home", "health", "other", "serv~
## $ Fjob         <chr> "teacher", "other", "other", "services", "other", "other"~
## $ reason       <chr> "course", "course", "other", "home", "home", "reputation"~
## $ guardian     <chr> "mother", "father", "mother", "mother", "father", "mother~
## $ traveltime   <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, ~
## $ studytime    <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, ~
## $ failures.mat <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ~
## $ schoolsup    <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", ~
## $ famsup       <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes~
## $ paid.mat     <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes"~
## $ activities   <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "~
## $ nursery      <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "y~
## $ higher       <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "~
## $ internet     <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes~
## $ romantic     <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "n~
## $ famrel       <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, ~
## $ freetime     <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, ~
## $ goout        <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, ~
## $ Dalc         <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ~
## $ Walc         <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, ~
## $ health       <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, ~
## $ absences.mat <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 1~
## $ G1.mat       <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14~
## $ G2.mat       <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14~
## $ G3.mat       <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 1~
## $ failures.por <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ~
## $ paid.por     <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no~
## $ absences.por <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2,~
## $ G1.por       <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14,~
## $ G2.por       <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 1~
## $ G3.por       <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 1~
## $ G1           <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 1~
## $ G2           <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, ~
## $ G3           <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16,~
## $ absences     <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, ~
## $ failures     <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ~
## $ paid         <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes"~
## $ alc_use      <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.~
## $ high_use     <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~

As we can see, there are 370 student, and 47 characteristics defined for each one. The most important are G3, which represents their final grade and high_use, which is measure of their alcohol intake during the week. Our research question is whether an increase in alcohol intake is associated to the students overall performance and/or is significantly associated to certain variables.

Lets first take a look at the distribution of the variables:

#' removing duplicated and aggregated columns 
stu_aggr_clean <- stu_aggr %>%
  dplyr::select(!ends_with(c(".mat", ".por")), 
                -Dalc, 
                -Walc)

#' plot distributions
plot_vars <- stu_aggr_clean %>%
  gather() %>%
  ggplot2::ggplot(., aes(value, fill = key)) +
  ggplot2::facet_wrap(key ~., scales = "free") +
  ggplot2::geom_bar(show.legend = F) + 
  ggplot2::theme_minimal() +
  viridis::scale_fill_viridis(option = "F", 
                              discrete = T)
plot_vars

Looking at these distributions, we can select 4 variables that could potentially play a role in increased alcohol intake and see whether they come out to be statistically significant. Lets pick the following:

goout: The frequency of going out with friends (1 - very low to 5 - very high).

sex: students sex (M - male, F - female)

famrel: quality of family relationships (1 - very bad to 5 - very good)

paid: extra paid classes within the course subject (Math or Portuguese)

vars_sel <- c("goout", 
          "famrel",
          "sex", 
          "paid")

to.plot <- stu_aggr_clean %>%
  dplyr::select(vars_sel[1:2], high_use) %>%
  tidyr::pivot_longer(cols = -high_use) %>%
  dplyr::arrange(name) %>%
  dplyr::mutate(name = factor(name, levels = c(unique(name))))


ggplot2::ggplot(to.plot, aes(x = high_use, 
                             y = value, 
                             fill = high_use)) +
  ggplot2::facet_wrap(name ~., scales = "free") +
  ggplot2::geom_jitter(aes(colour = high_use), 
                       show.legend = F) +
  ggplot2::geom_boxplot(show.legend = F) +
  viridis::scale_fill_viridis(option = "G",
                                begin = 0.4,
                                end = 0.95,
                                discrete = T) +
   viridis::scale_colour_viridis(option = "G",
                                begin = 0.4,
                                end = 0.95,
                                discrete = T) +
  ggplot2::theme_minimal()

to.plot <- stu_aggr_clean %>%
  dplyr::select(vars_sel[3:4], high_use) %>%
  tidyr::pivot_longer(cols = -high_use) %>%
  dplyr::arrange(name) %>%
  dplyr::mutate(name = factor(name, levels = c(unique(name))))

ggplot2::ggplot(to.plot, aes(x = high_use)) +
  ggplot2::facet_wrap(name ~ ., scales = "free") +
  ggplot2::geom_bar(aes(fill = value)) +
  viridis::scale_fill_viridis(option = "G", 
                              discrete = T, 
                              begin = 0.4, 
                              end = 0.9) +
  ggplot2::theme_minimal()

for (sub in 1:length(vars_sel)) {
  to.summ <- stu_aggr_clean %>%
    dplyr::group_by(high_use, !!!syms(vars_sel[sub])) %>%
    dplyr::summarise(n = n(), mean_G3 = mean(G3)) %>%
    dplyr::arrange(!!!vars_sel[sub])
  print(to.summ)
}
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 10 x 4
## # Groups:   high_use [2]
##    high_use goout     n mean_G3
##    <lgl>    <int> <int>   <dbl>
##  1 FALSE        1    19    10.7
##  2 FALSE        2    82    12.2
##  3 FALSE        3    97    12.2
##  4 FALSE        4    40    11.5
##  5 FALSE        5    21    10.1
##  6 TRUE         1     3    11.3
##  7 TRUE         2    15    11.9
##  8 TRUE         3    23    10.9
##  9 TRUE         4    38    11.1
## 10 TRUE         5    32    10.0
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 10 x 4
## # Groups:   high_use [2]
##    high_use famrel     n mean_G3
##    <lgl>     <int> <int>   <dbl>
##  1 FALSE         1     6    11.8
##  2 FALSE         2     9    11.2
##  3 FALSE         3    39    11.4
##  4 FALSE         4   128    11.7
##  5 FALSE         5    77    12.2
##  6 TRUE          1     2    12  
##  7 TRUE          2     9    10.7
##  8 TRUE          3    25    10.8
##  9 TRUE          4    52    11.1
## 10 TRUE          5    23    10.2
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   high_use [2]
##   high_use sex       n mean_G3
##   <lgl>    <chr> <int>   <dbl>
## 1 FALSE    F       154    11.4
## 2 FALSE    M       105    12.3
## 3 TRUE     F        41    11.8
## 4 TRUE     M        70    10.3
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   high_use [2]
##   high_use paid      n mean_G3
##   <lgl>    <chr> <int>   <dbl>
## 1 FALSE    no      140    11.4
## 2 FALSE    yes     119    12.3
## 3 TRUE     no       56    10.8
## 4 TRUE     yes      55    11.0

As we can see from the plots and tables, low family relations and high frequency of going out with friends are associated with increased alcohol consumption. Males were also more likely to consume more alcohol then females. Extra paid courses does not seem to show any trends, but may become clearer within the model.

Lets use all 4 variables in a logistic model for our response variable - high_use

#' fit model
glm.1 <- stats::glm(high_use ~ sex + goout + famrel + paid, 
                    data = stu_aggr_clean, 
                    family = "binomial")

#' summarise
summary(glm.1)
## 
## Call:
## stats::glm(formula = high_use ~ sex + goout + famrel + paid, 
##     family = "binomial", data = stu_aggr_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6862  -0.7894  -0.5213   0.8510   2.5063  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4751     0.6680  -3.705 0.000211 ***
## sexM          1.0333     0.2596   3.981 6.88e-05 ***
## goout         0.8022     0.1231   6.517 7.20e-11 ***
## famrel       -0.4350     0.1404  -3.098 0.001950 ** 
## paidyes       0.3161     0.2562   1.234 0.217272    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 377.58  on 365  degrees of freedom
## AIC: 387.58
## 
## Number of Fisher Scoring iterations: 4
#' extract coefficients
coef(glm.1)
## (Intercept)        sexM       goout      famrel     paidyes 
##  -2.4751385   1.0333396   0.8021770  -0.4349508   0.3161099
#' extract Odds ratios
OR <- coef(glm.1) %>%
  exp

#' extract confidence intervals
CI <- confint(glm.1) %>%
  exp
## Waiting for profiling to be done...
#' create table
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.08415133 0.02186678 0.3024711
## sexM        2.81043581 1.70049874 4.7152163
## goout       2.23039118 1.76438158 2.8618618
## famrel      0.64729650 0.48962135 0.8506400
## paidyes     1.37178094 0.83155189 2.2751637

As expected, sex (male), going out with friends, and family relations are important factors in explaining increased alcohol consumption. There is a positive relation with high_use and going out and a negative relation between high_use and better family relations.

The odds ratios is the probability of success over probability of failure. Once again, these show the higher proportion of alcohol consumption among males, going out, and decreasing family relations.

Next, we predict the results (high_use) using our model and observe its accuracy it assigning the correct outcomes.

#' predict and subset
stu_aggr_wprobs <- stu_aggr_clean %>%
  dplyr::mutate(probs = predict(glm.1, type = "response"), 
                preds = probs > 0.5) %>%
  dplyr::select(high_use, !!!vars_sel, probs, preds)
tail(stu_aggr_wprobs, 10)
##     high_use goout famrel sex paid      probs preds
## 361    FALSE     3      4   M   no 0.31538311 FALSE
## 362    FALSE     2      4   M   no 0.17118555 FALSE
## 363     TRUE     3      5   M  yes 0.29030314 FALSE
## 364    FALSE     3      5   F  yes 0.12705507 FALSE
## 365    FALSE     3      4   F  yes 0.18357661 FALSE
## 366    FALSE     2      5   F   no 0.04541047 FALSE
## 367    FALSE     4      4   F  yes 0.33400549 FALSE
## 368    FALSE     1      1   F   no 0.10833016 FALSE
## 369     TRUE     5      2   M   no 0.84542817  TRUE
## 370     TRUE     1      4   M   no 0.08475514 FALSE
#' confusion matrix
table(high_use = stu_aggr_wprobs$high_use, 
      preds = stu_aggr_wprobs$preds)
##         preds
## high_use FALSE TRUE
##    FALSE   235   24
##    TRUE     62   49

Lets now determine the predictive power of our model against randomly guessing the outcome. To do this we calculate the mean incorrectly classified outcomes with different probabilities. This serves as a penalty and the lower this value the better.

#' defining the loss function
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

#' probabiltiy of high use is 1 for each individual
loss_func(class = stu_aggr_wprobs$high_use,
          prob = 1)
## [1] 0.7
#' probabiltiy of high use is 0 for each individual
loss_func(class = stu_aggr_wprobs$high_use,
          prob = 0)
## [1] 0.3
#' probabiltiy of high use are taken from the model for each 
#' individual
loss_func(class = stu_aggr_wprobs$high_use,
          prob = stu_aggr_wprobs$probs)
## [1] 0.2324324

So, our model has the lowest penalty compared to randomly guessing.

Next, lets see how our model performs on randomized subsets of our data, once again measuring the mean incorrectly classified outcomes as the measure for performance

#' cross validation
library(boot)

#' cross-validation
stu_aggr_cv <- boot::cv.glm(data = stu_aggr_wprobs, 
                            cost = loss_func, 
                            glmfit = glm.1, 
                            K = 10)

stu_aggr_cv$delta[1]
## [1] 0.2378378

The prediction errors are slightly higher but it stills performs better than guessing.


Exercise 4: Clustering and classification

This weeks exercise is all about visually exploring statistical data. The exercise will be based on the Boston data provided with the MASS R package.

First, load the MASS library and read in the data:

library(tidyverse)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
BSTN <- MASS::Boston
glimpse(BSTN)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,~
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1~
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.~
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,~
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,~
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9~
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505~
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,~
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31~
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15~
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90~
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10~
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15~

The data consists of 506 entries for 14 variables that describe the housing value of different suburbs of Boston, New York, the air quality, and willingness of occupants to pay for clean air. A detailed description of each variable can be seen here

summary(BSTN)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

It seems like there is a lot of variation between different towns/suburbs, with the widest ranges being present in crime rates per capita (min: 0.0062, max: 88.98), age (min: 2.9, max: 100), and proportion of residential land zoned for lots over 25,000 sq.ft. (min: 0, max: 100).

Now lets look at the correlations between these variables:

#' create correlation matrix
BSTN_cor <- cor(BSTN) %>%
  round(digits = 2)

#' visualize
corrplot::corrplot(BSTN_cor, 
                   type = "upper", 
                   tl.pos = "d")

As can be observed from the plot, there is high correlation between most of the variables, with the most striking being the positive correlation of nox vs age, negative correlation of nox vs indus, and positive correlation of rad vs tax.

To account for the massive ranges of certain variable and to keep the data comparable, lets scale the variables to center the means around 0.

BSTN_scale <- as.data.frame(scale(BSTN))
summary(BSTN_scale)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Next, let restructure the crime rate variable to a categorical variable:

#' generate quantiles
crime_qtls <- quantile(BSTN_scale$crim)

#' create categroical variable
BSTN_scale_mod <- BSTN_scale %>%
  dplyr::mutate(crime = cut(crim, 
             breaks = crime_qtls, 
             include.lowest = T, 
             labels = c("low",
                        "mid_low", 
                        "mid_high", 
                        "high"))) %>%
  dplyr::select(-crim)

With crime as the variable of interest, we divide the data set into training and test sets

#' randomly extract indexes for 80% of the data
sub <- sample(nrow(BSTN_scale_mod), 
              size = nrow(BSTN_scale_mod) * 0.8)

#' subset the training set (80%)
train <- BSTN_scale_mod[sub,]

#' subset the test set (20%)
test <- BSTN_scale_mod[-sub,]

#' save correct classes
crt <- test$crime

#' remove crime from test data set
test <- test %>%
  dplyr::select(-crime)

Next, we perform Linear discriminant analysis (LDA) on the training set, which is a dimensionality reduction technique to identify separation within the data based on features/variables.

#' LDA
BSTN_lda <- MASS::lda(crime ~ ., data = train)
BSTN_lda
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##      low  mid_low mid_high     high 
## 0.259901 0.240099 0.259901 0.240099 
## 
## Group means:
##                  zn      indus        chas        nox         rm        age
## low       1.0348982 -0.9197802 -0.12234430 -0.8997404  0.4684579 -0.9298066
## mid_low  -0.1146069 -0.2453174  0.01179157 -0.5350933 -0.1108013 -0.3052129
## mid_high -0.3794348  0.1103209  0.17762524  0.3243173  0.0918388  0.3735702
## high     -0.4872402  1.0172187  0.01179157  1.0718359 -0.4373125  0.8398408
##                 dis        rad        tax     ptratio      black        lstat
## low       0.9238427 -0.6909262 -0.7214972 -0.47963828  0.3768695 -0.793148242
## mid_low   0.3029008 -0.5343243 -0.4681280 -0.05327035  0.3151947 -0.137244799
## mid_high -0.3320558 -0.3857621 -0.3098880 -0.25484432  0.1048269 -0.005563586
## high     -0.8569352  1.6371072  1.5133254  0.77958792 -0.7622250  0.900329437
##                   medv
## low       0.5203038770
## mid_low   0.0009166787
## mid_high  0.1684332832
## high     -0.6605420910
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08822181  0.84339651 -0.76731376
## indus   -0.02303195 -0.26733247  0.44886061
## chas    -0.05790445 -0.05886530  0.07567103
## nox      0.43386681 -0.54050924 -1.55904267
## rm      -0.09850735 -0.06472423 -0.15348683
## age      0.25654789 -0.28719825 -0.17421549
## dis     -0.08756626 -0.22615601  0.02444811
## rad      3.01699904  0.80988189  0.17764119
## tax     -0.06224867  0.12039532  0.32972045
## ptratio  0.12295211  0.04893984 -0.27694434
## black   -0.12552205 -0.01878472  0.11383890
## lstat    0.23775802 -0.28463889  0.32446697
## medv     0.20054721 -0.39729101 -0.25681264
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9432 0.0440 0.0128
#' define functions for lda arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#' target classes
tr_crt <- as.numeric(train$crime)

#' plot LDA
plot(BSTN_lda, 
     col = tr_crt, 
     dimen = 2, 
     pch = tr_crt)
lda.arrows(BSTN_lda, myscale = 2)

As we can see from the plot, crime rates separate very well between high and mid-low/low, while there is still some separation between the mid_high and low/mid_low categories. Furthermore, the variables rad, nox, and zn seem to be crucial for the separation.

Now lets predict the values on the test data set

#' predict the classes
BSTN_pred <- predict(BSTN_lda, 
                     newdata = test)

#' compare against the correct classes
table(correct = crt, 
      predicted = BSTN_pred$class)
##           predicted
## correct    low mid_low mid_high high
##   low        9      11        2    0
##   mid_low    5      20        4    0
##   mid_high   0       1       20    0
##   high       0       0        0   30

There were some classification errors in the low class, but they were fairly good for the rest.

Next up, we find the distance between the variables and perform k-means clustering

#' distances
BSTN_dist <- stats::dist(BSTN_scale)

#' summary
summary(BSTN_dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#' manhattan distance matrix
BSTN_dist_man <- dist(BSTN_scale, 
                      method = 'manhattan')

#' summary
summary(BSTN_dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
#' k-means clustering
BSTN_km <- stats::kmeans(BSTN_dist, 
                         centers = 3)

#' visualize
pairs(BSTN_scale, 
                col = BSTN_km$cluster)

Now lets determine the optimal k value for k means clustering

# set.seed
set.seed(211)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, 
            centers = 2)

# plot the Boston dataset with clusters
pairs(BSTN_scale, 
      col = km$cluster)

From the q-plot we can see that the most drastic drop in “within cluster sum of squares (WCSS)” is at 2, which indicative of the optimal k value.

Super bonus

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(BSTN_lda$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% BSTN_lda$scaling
matrix_product <- as.data.frame(matrix_product)


plotly::plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

The plots generally look similar with high crime rates clearly separated from the other groupings.


Exercise 5: Dimensionality reduction techniques

For this exercise we will be looking at human development index (HDI) parameters presented by the United nations developments programme to assess the development of countries. Details of the project and explanations of the variables can be found here.

First, lets load the data:

library(tidyverse)

#' read in modified file
human_data <- read.csv(here::here("data", 
                                  "human.csv"), 
                       row.names = 1)

#' glimpse
dplyr::glimpse(human_data)
## Rows: 155
## Columns: 8
## $ Edu.sec.prop <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.9690608, 0.~
## $ Lbr_prop     <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.8286119, 0.~
## $ Edu.Y        <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9, 19.~
## $ Life.ExpAB   <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0, 81.~
## $ GNI.pc       <int> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 52947, 4~
## $ MMR          <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 11, 6,~
## $ Adl_brthR    <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3, 6.0~
## $ Prl_prc      <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2, 31.~
#' summary
base::summary(human_data)
##   Edu.sec.prop       Lbr_prop          Edu.Y         Life.ExpAB   
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##      GNI.pc            MMR           Adl_brthR         Prl_prc     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#' pairs
GGally::ggpairs(human_data)

From the summary we can observe the large differences between countries, especially with GNI. Form the visualization we can further see that education and life expectancy at birth are strongly positively correlated with GNI, while both maternal mortality rate and adolescent birth rate and negatively correlated.

Next, we carry out principle component analysis (PCA)

#' PCA
pca <- stats::prcomp(human_data)
biplot(pca, 
       choices = 1:2, 
       cex = c(0.8, 1), 
       col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Looks like most countries are clumped together but a a handful can be seen branching away towards the bottom and left. These might be extreme entries that deviate drastically from average.

Now, lets standardize the variables and try again

#' scale the data
human_data_s <- scale(human_data)

#' glimpse
base::summary(human_data_s)
##   Edu.sec.prop        Lbr_prop           Edu.Y           Life.ExpAB     
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##      GNI.pc             MMR            Adl_brthR          Prl_prc       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
#' redo PCA analysis
pca2 <- stats::prcomp(human_data_s)
biplot(pca2, 
       choices = 1:2, 
       cex = c(0.8, 1), 
       col = c("grey40", "deeppink2"))

That makes things clearer. The countries are now more clearly distributed and the reason for separation is also much more obvious. The countries branching towards the left corner are mostly deviating due to education status while the ones branching towards the right corner are separated by maternal mortality rate and adolescent birth rate.

Moving things along. Tea! This data comes from the FactoMineR package and contains answers from a questionnaire filled by 300 individuals on their tea consumption, product preception, and personal questions. Details

library(FactoMineR)

#' load data
data("tea")

#' glimpse
dplyr::glimpse(tea)
## Rows: 300
## Columns: 36
## $ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, b~
## $ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea time, N~
## $ evening          <fct> Not.evening, Not.evening, evening, Not.evening, eveni~
## $ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch~
## $ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, d~
## $ always           <fct> Not.always, Not.always, Not.always, Not.always, alway~
## $ home             <fct> home, home, home, home, home, home, home, home, home,~
## $ work             <fct> Not.work, Not.work, work, Not.work, Not.work, Not.wor~
## $ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, N~
## $ friends          <fct> Not.friends, Not.friends, friends, Not.friends, Not.f~
## $ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, No~
## $ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,~
## $ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl G~
## $ How              <fct> alone, milk, alone, alone, alone, alone, alone, milk,~
## $ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar,~
## $ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag,~
## $ where            <fct> chain store, chain store, chain store, chain store, c~
## $ price            <fct> p_unknown, p_variable, p_variable, p_variable, p_vari~
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 6~
## $ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F,~
## $ SPC              <fct> middle, middle, other worker, student, employee, stud~
## $ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsman, sport~
## $ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-4~
## $ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/we~
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-ex~
## $ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spirituality,~
## $ healthy          <fct> healthy, healthy, healthy, healthy, Not.healthy, heal~
## $ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diure~
## $ friendliness     <fct> Not.friendliness, Not.friendliness, friendliness, Not~
## $ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not.iron ab~
## $ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminin~
## $ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sophisticat~
## $ slimming         <fct> No.slimming, No.slimming, No.slimming, No.slimming, N~
## $ exciting         <fct> No.exciting, exciting, No.exciting, No.exciting, No.e~
## $ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxin~
## $ effect.on.health <fct> No.effect on health, No.effect on health, No.effect o~
#' clean the data
tea_time <- tea %>%
  dplyr::select(Tea,
                How,
                how,
                sugar,
                where,
                lunch)

#' summaries
base::summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
# visualize
ggplot(gather(tea_time), aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

#' multple correspondence analysis (MCA)
mca <- MCA(tea_time, graph = FALSE)

#' summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
#' visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

#' Biplot
factoextra::fviz_mca_biplot(mca, label = "var")

#' extra plots from the package
res.mca <- MCA(tea,
               quanti.sup= 19,
               quali.sup = 20:36, 
               graph = F)
plot(res.mca,
     invisible = c("ind", "quali.sup", "quanti.sup"),
     cex = 0.8)
## Warning: ggrepel: 30 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From the plots it seems that Earl grey is usually taken with both milk and sugar, while black and green are not. Tea bags are likely to be bought from chain stores while unpackaged is bought from tea shops.

The last plot shows further categorical variables that provide further insights. Early grey seems to be the tea of choice for breakfast, lunch, pubs, and with friends, while green tea is usually reserved for dinner and privately.